Stroke remains one of the leading global causes of mortality and disability, underscoring the urgent need for more accurate and timely risk prediction. While existing studies have applied machine learning (ML) models to public datasets, their performance is often limited by class imbalance and low variability, leaving a gap in clinically reliable solutions. This paper addresses that gap by evaluating stroke prediction using both a real-time clinical dataset collected from SKIMS and SMHS hospitals in Kashmir and a widely used Kaggle dataset. Our methodology involved comprehensive preprocessing (normalization, handling missing values, and class balancing), feature engineering, and implementation of multiple ML algorithms—logistic regression, decision trees, random forests, gradient boosting, XGBoost, CatBoost—alongside a deep neural network (DNN). Models were trained and tuned within an experimental Python framework using Scikit-learn and Keras, incorporating techniques such as cross-validation and early stopping. Results revealed that ensemble methods achieved near-perfect accuracy on the real-time hospital dataset, highlighting the effect of dataset characteristics, but only modest balanced accuracy (?51.1%) on the Kaggle dataset. Notably, the DNN outperformed classical ML models on the Kaggle dataset, reaching 89.6% test accuracy and demonstrating improved sensitivity for stroke detection. These findings emphasize the importance of dataset quality in stroke risk prediction and suggest that deep learning approaches may offer greater clinical applicability than traditional ML methods
Introduction
Stroke is a major global health concern and a leading cause of death and disability, occurring due to disrupted blood supply to the brain. Early detection is critical, especially in regions with limited healthcare access, as delays in diagnosis often lead to irreversible damage. Traditional risk prediction methods (like the Framingham model) rely on fixed, linear factors and fail to capture complex relationships in diverse populations.
To overcome these limitations, this study explores the use of machine learning (ML) and deep learning (DL) techniques for stroke prediction. The research utilizes two datasets: a public Kaggle dataset and a real-time clinical dataset from hospitals in Kashmir (SKIMS and SMHS), enabling both standardized comparison and real-world applicability.
The methodology involves:
Data preprocessing (handling missing values, encoding, scaling)
Exploratory data analysis (EDA) to identify patterns
Feature engineering based on domain knowledge
Training multiple ML models such as Logistic Regression, Decision Trees, Random Forest, KNN, Naive Bayes, and Gradient Boosting
Developing a Deep Neural Network (DNN) for capturing complex nonlinear relationships
Addressing class imbalance using techniques like oversampling and class weighting
Evaluating models using metrics like accuracy, precision, recall, and F1-score
The study emphasizes that accuracy alone is insufficient due to the rarity of stroke cases; metrics like recall are more clinically important to avoid missing true stroke patients.
Key contributions of the research include:
Comparing traditional ML models with deep learning approaches
Evaluating performance across both global and local datasets
Incorporating real-world clinical variability into model design
Proposing a scalable, AI-based system for early stroke risk detection
Overall, the research demonstrates that AI-driven models can significantly improve early stroke prediction, enabling better preventive care, efficient resource allocation, and reduced healthcare burden, while bridging the gap between theoretical models and real-world clinical use.
Conclusion
Using two datasets—a real-time dataset gathered from SKIMS and SMHS hospitals and a publicly accessible Kaggle dataset—this paper investigated stroke prediction using both traditional machine learning (ML) and deep learning (DL) techniques. The paper showed that strong prediction systems that can identify people at risk of stroke may be created with the right preprocessing, feature engineering, and model selection. Achieving near-perfect accuracy (100%) across the majority of ensemble algorithms, including Random Forest, XGBoost, CatBoost, AdaBoost, and Gradient Boosting, was made possible by the hospital dataset, which was more clinically rich and context-specific. Because of the class imbalance and noisy data, the Kaggle dataset offered a more demanding yet balanced setting. Despite achieving middling performance (~51% accuracy), ensemble approaches such as Gradient Boosting and AdaBoost produced comparably better results than all other studied models, indicating the challenge of using uncurated public data. With an overall test accuracy of about 89.6%, neural network models shown promise by providing more sensitivity to stroke instances on the Kaggle dataset. They also outperformed classical models in generalisation, particularly when it came to capturing minority class predictions. Due to improved data quality, focused feature design, and domain-specific factors, the models based on the hospital dataset ultimately performed noticeably better than those trained on the Kaggle dataset. These results demonstrate how accurately modelling real-world data may significantly improve prediction results. Furthermore, the use of these AI-powered diagnostic tools might revolutionise healthcare settings with limited resources by facilitating prompt intervention and even saving lives.
References
[1] F. Asadi, M. Rahimi, A. H. Daeechini, and A. Paghe, “The most efficient machine learning algorithms in stroke prediction: A systematic review,” Health Sci. Rep., vol. 7, no. 10, p. e70062, Oct. 2024, doi: 10.1002/hsr2.70062.
[2] S. K. Uma and S. R. Rakshith, “Stroke analysis using 10 ML comparison,” Int. J. Res. Appl. Sci. Eng. Technol., vol. 10, pp. 3857–3862, 2022.
[3] M. M. Islam et al., “Stroke prediction analysis using machine learning classifiers and feature technique,” Int. J. Electron. Commun. Syst., vol. 1, pp. 57–62, 2021.
[4] M. Alruily, S. A. El-Ghany, A. M. Mostafa, M. Ezz, and A. A. El-Aziz, “A-tuning ensemble machine learning technique for cerebral stroke prediction,” Appl. Sci., vol. 13, no. 5047, 2023.
[5] Z. Chen, “Stroke risk prediction based on machine learning algorithms,” Highlights Sci. Eng. Technol., vol. 38, pp. 932–941, 2023.
[6] T. M. Geethanjali, M. D. Divyashree, S. K. Monisha, and M. K. Sahana, “Stroke prediction using machine learning,” Int. J. Emerg. Technol. Innov. Res., vol. 8, pp. 710–717, 2021.
[7] D. Paul, G. Gain, and S. Orang, “Advanced random forest ensemble for stroke prediction,” Int. J. Adv. Res. Comput. Commun. Eng., vol. 11, no. 3, 2022, doi: 10.17148/IJARCCE.2022.11343.
[8] P. S. Mattas, “Brain stroke prediction using machine learning,” Int. J. Res. Publ. Rev., vol. 3, pp. 711–722, 2022.
[9] M. S. Pathan, Z. Jianbiao, D. John, A. Nag, and S. Dev, “Identifying stroke indicators using rough sets,” IEEE Access, vol. 8, pp. 210318–210327, 2020.
[10] M. U. Emon et al., “Performance analysis of machine learning approaches in stroke prediction,” in 2020 4th International Conference on Electronics, Communication and Aerospace Technology (ICECA), 2020.
[11] S. Dev et al., “A predictive analytics approach for stroke prediction using machine learning and neural networks,” Healthc. Anal., vol. 2, p. 100032, 2022.
[12] L. P. Nguyen et al., “The utilization of machine learning algorithms for assisting physicians in the diagnosis of diabetes,” Diagnostics, vol. 13, no. 2087, 2023.
[13] N. Hatami et al., “A novel autoencoders-LSTM model for stroke outcome prediction using multimodal MRI data,” arXiv preprint arXiv:2303.09484, 2023.
[14] L. García-Terriza, J. L. Risco-Martín, G. Reig Roselló, and J. L. Ayala, “Predictive and diagnosis models of stroke from hemodynamic signal monitoring,” arXiv preprint arXiv:2306.05289, 2023.
[15] C. Fernandez-Lozano et al., “Random forest-based prediction of stroke outcome,” arXiv preprint arXiv:2402.00638, 2024.
[16] A. Pinto et al., “Combining unsupervised and supervised learning for predicting the final stroke lesion,” arXiv preprint arXiv:2101.00489, 2021.
[17] S. Golemati and K. Nikita, “Cardiovascular computing-methodologies and clinical applications,” Springer, 2019.
[18] A. Gastounioti et al., “A novel computerized tool to stratify risk in carotid atherosclerosis using kinematic features of the arterial wall,” IEEE J. Biomed. Health Inform., vol. 18, no. 5, pp. 1472–1482, 2014.
[19] S. Golemati et al., “Toward novel noninvasive and low-cost markers for predicting strokes in asymptomatic carotid atherosclerosis: the role of ultrasound image analysis,” IEEE Trans. Biomed. Eng., vol. 60, no. 3, pp. 717–726, 2013.